Overview

Dataset statistics

Number of variables12
Number of observations2891
Missing cells48
Missing cells (%)0.1%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory271.2 KiB
Average record size in memory96.0 B

Variable types

NUM8
CAT4

Reproduction

Analysis started2020-06-14 09:59:56.071474
Analysis finished2020-06-14 10:00:09.067661
Duration13 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

City has a high cardinality: 567 distinct values High cardinality
Female Population is highly correlated with Male Population and 2 other fieldsHigh correlation
Male Population is highly correlated with Female Population and 2 other fieldsHigh correlation
Total Population is highly correlated with Male Population and 2 other fieldsHigh correlation
Foreign-born is highly correlated with Male Population and 2 other fieldsHigh correlation
State Code is highly correlated with StateHigh correlation
State is highly correlated with State CodeHigh correlation
City is uniformly distributed Uniform

Variables

City
Categorical

HIGH CARDINALITY
UNIFORM

Distinct count567
Unique (%)19.6%
Missing0
Missing (%)0.0%
Memory size22.6 KiB
Bloomington
 
15
Springfield
 
15
Columbia
 
15
Allen
 
10
Norwalk
 
10
Other values (562)
2826
ValueCountFrequency (%) 
Bloomington150.5%
 
Springfield150.5%
 
Columbia150.5%
 
Allen100.3%
 
Norwalk100.3%
 
Richmond100.3%
 
Jackson100.3%
 
Aurora100.3%
 
Glendale100.3%
 
Kansas City100.3%
 
Other values (557)277696.0%
 

Length

Max length47
Median length9
Mean length9.10307852
Min length2

State
Categorical

HIGH CORRELATION

Distinct count49
Unique (%)1.7%
Missing0
Missing (%)0.0%
Memory size22.6 KiB
California
676
Texas
 
273
Florida
 
222
Illinois
 
91
Washington
 
85
Other values (44)
1544
ValueCountFrequency (%) 
California67623.4%
 
Texas2739.4%
 
Florida2227.7%
 
Illinois913.1%
 
Washington852.9%
 
Arizona802.8%
 
Colorado802.8%
 
Michigan792.7%
 
Virginia702.4%
 
North Carolina702.4%
 
Other values (39)116540.3%
 

Length

Max length20
Median length8
Mean length8.426841923
Min length4

Median Age
Real number (ℝ≥0)

Distinct count180
Unique (%)6.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean35.49488066413006
Minimum22.9
Maximum70.5
Zeros0
Zeros (%)0.0%
Memory size22.6 KiB

Quantile statistics

Minimum22.9
5-th percentile28.8
Q132.8
median35.3
Q338
95-th percentile42.6
Maximum70.5
Range47.6
Interquartile range (IQR)5.2

Descriptive statistics

Standard deviation4.40161673
Coefficient of variation (CV)0.1240070863
Kurtosis4.164544019
Mean35.49488066
Median Absolute Deviation (MAD)2.6
Skewness0.6466184403
Sum102615.7
Variance19.37422984
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
35.7501.7%
 
33.4481.7%
 
36.8451.6%
 
34.1451.6%
 
33.1451.6%
 
34.5441.5%
 
38.1421.5%
 
35401.4%
 
35.3401.4%
 
35.5401.4%
 
Other values (170)245284.8%
 
ValueCountFrequency (%) 
22.950.2%
 
2340.1%
 
23.550.2%
 
23.650.2%
 
23.950.2%
 
ValueCountFrequency (%) 
70.530.1%
 
48.850.2%
 
47.940.1%
 
47.640.1%
 
47.450.2%
 

Male Population
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count593
Unique (%)20.5%
Missing3
Missing (%)0.1%
Infinite0
Infinite (%)0.0%
Mean97328.42624653739
Minimum29281.0
Maximum4081698.0
Zeros0
Zeros (%)0.0%
Memory size22.6 KiB

Quantile statistics

Minimum29281
5-th percentile32290
Q139289
median52341
Q386641.75
95-th percentile296902.6
Maximum4081698
Range4052417
Interquartile range (IQR)47352.75

Descriptive statistics

Standard deviation216299.9369
Coefficient of variation (CV)2.222371667
Kurtosis209.8137883
Mean97328.42625
Median Absolute Deviation (MAD)15991
Skewness12.73559694
Sum281084495
Variance4.678566272e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
40601100.3%
 
33993100.3%
 
3431150.2%
 
26489350.2%
 
4028250.2%
 
6511350.2%
 
3882750.2%
 
4898450.2%
 
114968650.2%
 
5563950.2%
 
Other values (583)282897.8%
 
ValueCountFrequency (%) 
2928150.2%
 
2999550.2%
 
3000750.2%
 
3019340.1%
 
3075850.2%
 
ValueCountFrequency (%) 
408169850.2%
 
195899850.2%
 
132001550.2%
 
114968650.2%
 
78683350.2%
 

Female Population
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count594
Unique (%)20.6%
Missing3
Missing (%)0.1%
Infinite0
Infinite (%)0.0%
Mean101769.63088642659
Minimum27348.0
Maximum4468707.0
Zeros0
Zeros (%)0.0%
Memory size22.6 KiB

Quantile statistics

Minimum27348
5-th percentile34163
Q141227
median53809
Q389604
95-th percentile315853.35
Maximum4468707
Range4441359
Interquartile range (IQR)48377

Descriptive statistics

Standard deviation231564.5726
Coefficient of variation (CV)2.2753799
Kurtosis227.6330518
Mean101769.6309
Median Absolute Deviation (MAD)15771
Skewness13.32044486
Sum293910694
Variance5.362215127e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
35801100.3%
 
9735050.2%
 
8823250.2%
 
50860250.2%
 
6066450.2%
 
5615150.2%
 
3618250.2%
 
11844350.2%
 
4419950.2%
 
9694450.2%
 
Other values (584)283398.0%
 
ValueCountFrequency (%) 
2734850.2%
 
3123840.1%
 
3145640.1%
 
3217330.1%
 
3239750.2%
 
ValueCountFrequency (%) 
446870750.2%
 
201289850.2%
 
140054150.2%
 
114894250.2%
 
82617250.2%
 

Total Population
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count594
Unique (%)20.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean198966.77931511588
Minimum63215
Maximum8550405
Zeros0
Zeros (%)0.0%
Memory size22.6 KiB

Quantile statistics

Minimum63215
5-th percentile67271
Q180429
median106782
Q3175232
95-th percentile618619
Maximum8550405
Range8487190
Interquartile range (IQR)94803

Descriptive statistics

Standard deviation447555.9296
Coefficient of variation (CV)2.249400283
Kurtosis219.2158805
Mean198966.7793
Median Absolute Deviation (MAD)32640
Skewness13.04462299
Sum575212959
Variance2.003063102e+11
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
68097100.3%
 
71024100.3%
 
10801250.2%
 
9985650.2%
 
6914950.2%
 
6710650.2%
 
11421150.2%
 
29853750.2%
 
15176050.2%
 
33542350.2%
 
Other values (584)283197.9%
 
ValueCountFrequency (%) 
6321550.2%
 
6365150.2%
 
6379250.2%
 
6460950.2%
 
6481940.1%
 
ValueCountFrequency (%) 
855040550.2%
 
397189650.2%
 
272055650.2%
 
229862850.2%
 
156744250.2%
 

Number of Veterans
Real number (ℝ≥0)

Distinct count577
Unique (%)20.0%
Missing13
Missing (%)0.4%
Infinite0
Infinite (%)0.0%
Mean9367.832522585128
Minimum416.0
Maximum156961.0
Zeros0
Zeros (%)0.0%
Memory size22.6 KiB

Quantile statistics

Minimum416
5-th percentile1990
Q13739
median5397
Q39368
95-th percentile29511
Maximum156961
Range156545
Interquartile range (IQR)5629

Descriptive statistics

Standard deviation13211.21992
Coefficient of variation (CV)1.410274991
Kurtosis39.85381815
Mean9367.832523
Median Absolute Deviation (MAD)2281
Skewness5.295923255
Sum26960622
Variance174536331.9
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
4211100.3%
 
3397100.3%
 
3647100.3%
 
3116100.3%
 
3063100.3%
 
5532100.3%
 
5714100.3%
 
5204100.3%
 
302790.3%
 
340490.3%
 
Other values (567)278096.2%
 
(Missing)130.4%
 
ValueCountFrequency (%) 
41630.1%
 
62950.2%
 
69340.1%
 
70550.2%
 
72450.2%
 
ValueCountFrequency (%) 
15696150.2%
 
10908950.2%
 
9248950.2%
 
8541750.2%
 
7543250.2%
 

Foreign-born
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count587
Unique (%)20.4%
Missing13
Missing (%)0.4%
Infinite0
Infinite (%)0.0%
Mean40653.598679638635
Minimum861.0
Maximum3212500.0
Zeros0
Zeros (%)0.0%
Memory size22.6 KiB

Quantile statistics

Minimum861
5-th percentile3215
Q19224
median18822
Q333971.75
95-th percentile109222.15
Maximum3212500
Range3211639
Interquartile range (IQR)24747.75

Descriptive statistics

Standard deviation155749.1037
Coefficient of variation (CV)3.831127101
Kurtosis310.3878352
Mean40653.59868
Median Absolute Deviation (MAD)11101
Skewness16.35579487
Sum117001057
Variance2.425778329e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
5757100.3%
 
13409100.3%
 
973550.2%
 
1325350.2%
 
2295850.2%
 
627550.2%
 
1845150.2%
 
1051650.2%
 
612350.2%
 
3325850.2%
 
Other values (577)281897.5%
 
(Missing)130.4%
 
ValueCountFrequency (%) 
86150.2%
 
105850.2%
 
106240.1%
 
122450.2%
 
153150.2%
 
ValueCountFrequency (%) 
321250050.2%
 
148542550.2%
 
69621050.2%
 
57346350.2%
 
40149350.2%
 

Average Household Size
Real number (ℝ≥0)

Distinct count161
Unique (%)5.6%
Missing16
Missing (%)0.6%
Infinite0
Infinite (%)0.0%
Mean2.742542608695652
Minimum2.0
Maximum4.98
Zeros0
Zeros (%)0.0%
Memory size22.6 KiB

Quantile statistics

Minimum2
5-th percentile2.22
Q12.43
median2.65
Q32.95
95-th percentile3.58
Maximum4.98
Range2.98
Interquartile range (IQR)0.52

Descriptive statistics

Standard deviation0.4332910879
Coefficient of variation (CV)0.1579888263
Kurtosis2.861908247
Mean2.742542609
Median Absolute Deviation (MAD)0.24
Skewness1.409563879
Sum7884.81
Variance0.1877411669
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
2.4782.7%
 
2.72682.4%
 
2.97551.9%
 
2.64541.9%
 
2.39541.9%
 
2.41501.7%
 
2.68501.7%
 
2.52491.7%
 
2.73481.7%
 
2.55451.6%
 
Other values (151)232480.4%
 
ValueCountFrequency (%) 
250.2%
 
2.0650.2%
 
2.08100.3%
 
2.140.1%
 
2.1150.2%
 
ValueCountFrequency (%) 
4.9850.2%
 
4.7840.1%
 
4.5850.2%
 
4.5730.1%
 
4.4340.1%
 

State Code
Categorical

HIGH CORRELATION

Distinct count49
Unique (%)1.7%
Missing0
Missing (%)0.0%
Memory size22.6 KiB
CA
676
TX
 
273
FL
 
222
IL
 
91
WA
 
85
Other values (44)
1544
ValueCountFrequency (%) 
CA67623.4%
 
TX2739.4%
 
FL2227.7%
 
IL913.1%
 
WA852.9%
 
CO802.8%
 
AZ802.8%
 
MI792.7%
 
VA702.4%
 
NC702.4%
 
Other values (39)116540.3%
 

Length

Max length2
Median length2
Mean length2
Min length2

Race
Categorical

Distinct count5
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size22.6 KiB
Hispanic or Latino
596
White
589
Black or African-American
584
Asian
583
American Indian and Alaska Native
539
ValueCountFrequency (%) 
Hispanic or Latino59620.6%
 
White58920.4%
 
Black or African-American58420.2%
 
Asian58320.2%
 
American Indian and Alaska Native53918.6%
 

Length

Max length33
Median length18
Mean length16.94050502
Min length5

Count
Real number (ℝ≥0)

Distinct count2785
Unique (%)96.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean48963.77447250087
Minimum98
Maximum3835726
Zeros0
Zeros (%)0.0%
Memory size22.6 KiB

Quantile statistics

Minimum98
5-th percentile778.5
Q13435
median13780
Q354447
95-th percentile162670.5
Maximum3835726
Range3835628
Interquartile range (IQR)51012

Descriptive statistics

Standard deviation144385.5886
Coefficient of variation (CV)2.94882472
Kurtosis246.5728171
Mean48963.77447
Median Absolute Deviation (MAD)12231
Skewness12.97352625
Sum141554272
Variance2.084719819e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
88130.1%
 
134330.1%
 
654730.1%
 
161530.1%
 
25130.1%
 
53530.1%
 
87630.1%
 
71330.1%
 
245220.1%
 
1181520.1%
 
Other values (2775)286399.0%
 
ValueCountFrequency (%) 
981< 0.1%
 
1281< 0.1%
 
1581< 0.1%
 
1821< 0.1%
 
2031< 0.1%
 
ValueCountFrequency (%) 
38357261< 0.1%
 
24851251< 0.1%
 
21922481< 0.1%
 
21776501< 0.1%
 
19367321< 0.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

CityStateMedian AgeMale PopulationFemale PopulationTotal PopulationNumber of VeteransForeign-bornAverage Household SizeState CodeRaceCount
0Silver SpringMaryland33.840601.041862.0824631562.030908.02.60MDHispanic or Latino25924
1QuincyMassachusetts41.044129.049500.0936294147.032935.02.39MAWhite58723
2HooverAlabama38.538040.046799.0848394819.08229.02.58ALAsian4759
3Rancho CucamongaCalifornia34.588127.087105.01752325821.033878.03.18CABlack or African-American24437
4NewarkNew Jersey34.6138040.0143873.02819135829.086253.02.73NJWhite76402
5PeoriaIllinois33.156229.062432.01186616634.07517.02.40ILAmerican Indian and Alaska Native1343
6AvondaleArizona29.138712.041971.0806834815.08355.03.18AZBlack or African-American11592
7West CovinaCalifornia39.851629.056860.01084893800.037038.03.56CAAsian32716
8O'FallonMissouri36.041762.043270.0850325783.03269.02.77MOHispanic or Latino2583
9High PointNorth Carolina35.551751.058077.01098285204.016315.02.65NCAsian11060

Last rows

CityStateMedian AgeMale PopulationFemale PopulationTotal PopulationNumber of VeteransForeign-bornAverage Household SizeState CodeRaceCount
2881GulfportMississippi35.133108.038764.0718726646.03072.02.54MSWhite42870
2882DavisCalifornia26.333493.034163.0676562176.013997.02.69CAAmerican Indian and Alaska Native779
2883Los AngelesCalifornia35.01958998.02012898.0397189685417.01485425.02.86CABlack or African-American404868
2884Mount VernonNew York38.531876.036745.0686212064.023777.02.85NYHispanic or Latino9446
2885LynchburgVirginia28.738614.041198.0798124322.04364.02.48VAWhite53727
2886StocktonCalifornia32.5150976.0154674.030565012822.079583.03.16CAAmerican Indian and Alaska Native19834
2887SouthfieldMichigan41.631369.041808.0731774035.04011.02.27MIAmerican Indian and Alaska Native983
2888IndianapolisIndiana34.1410615.0437808.084842342186.072456.02.53INWhite553665
2889SomervilleMassachusetts31.041028.039306.0803342103.022292.02.43MAAmerican Indian and Alaska Native374
2890Coral SpringsFlorida37.263316.066186.01295024724.038552.03.17FLWhite90896